superheat: An R package for generating supervised heatmapsThe superheat package was developed to produce supervised heatmaps which act as a data exploration tool designed to visually explore complex datasets. The highly customisable image is generated by combining a clustered heatmap with scatterplots with the aim of both visualizing the information contained in our data as well as assessing the adequacy of models fit to our data.
The goal of this guide is to help you understand how to use the superheat package in R to assess your data model of interest. First, you need to download and install the package. This can be done using the devtools package. If you have not yet done so, you will need to download it. The devtools package allows the user to download and install packages hosted on github pages, such as this one. To do this for most systems you need to type the following code into your R console:
install.packages("devtools")
devtools::install_github("rlbarter/superheat")
Next, load the superheat library into your workspace:
library(superheat)
As mentioned above, the primary aim of this package is to produce a supervised heatmap that can be used to both explore your data and diagnose areas of your data where your model may be performing sub-optimally (and to gives clues about possible improvements). The plot consists of two elements:
a heatmap: a (possibly clustered) matrix \({\bf X}\)).
a plot above and to the right of the heatmap: these plots could be scatterplots, barplots, scatterplots with a smoothed curve, isolated smoothed curves and line plots. The entries of these additional plots correspond to the columns and rows of the heatmap, respectively.
cluster or row/column labels below and to the left of the heatmap: these labels correspond to the cluster numbers/names or the row/column names.
The package consists of a single function: superheat.
The superheat function takes data objects, the most important of which are X (the heatmap matrix), yr (a vector of values to be plotted to the right of the heatmap) and yt (a vector of values to be plotted above the heatmap), although both yr and yt are optional. For example, the plot could be generated for the famous iris dataset as follows
# define a linear model and isolate coefficients for each variable
iris.coef <- lm(Petal.Length ~ Sepal.Length + Sepal.Width + Petal.Width, data = iris)$coef
# generate the plot:
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")], # heatmap matrix
yr = iris[,"Petal.Length"], #plot Petal.Length to the right
yr.axis.name = "Petal.Length",
membership.rows = iris[,"Species"], # cluster the rows by species
yt = iris.coef[-1], # plot the model coefficients above
yt.plot.type = "bar", # make the plot above a barplot
yt.axis.name = "coefficient")
This plot shows us a heatmap of the iris matrix, \(X\), which consists of 150 rows (the first 10 of which are shown below) and 3 variables (the first three columns of the table below). We also have a species variable (the fourth column of the table below) that we are using as our cluster membership identifier. Finally, on the right hand side of the plot we are plotting a fourth variable Sepal.Length. We can see from this plot that the virginica species has longer petals and sepals than Setosa which has short sepals and particularly short petal width.
| Sepal.Width | Sepal.Length | Petal.Width | Species |
|---|---|---|---|
| 3.5 | 5.1 | 0.2 | setosa |
| 3 | 4.9 | 0.2 | setosa |
| 3.2 | 4.7 | 0.2 | setosa |
| 3.1 | 4.6 | 0.2 | setosa |
| 3.6 | 5 | 0.2 | setosa |
| 3.9 | 5.4 | 0.4 | setosa |
| 3.4 | 4.6 | 0.3 | setosa |
| 3.4 | 5 | 0.2 | setosa |
| 2.9 | 4.4 | 0.2 | setosa |
| 3.1 | 4.9 | 0.1 | setosa |
Below we continue with the iris example and show how to use the numerous options for customizing the plot
It is easy to add text to each cluster (or variable) cell in the heatmap using the X.text argument.
X.text <- matrix(c("Sepal", "Sepal", "Sepal", "Sepal", "Sepal", "Sepal", "Petal", "Petal", "Petal"), ncol = 3, byrow = T)
X.text
#> [,1] [,2] [,3]
#> [1,] "Sepal" "Sepal" "Sepal"
#> [2,] "Sepal" "Sepal" "Sepal"
#> [3,] "Petal" "Petal" "Petal"
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")], # heatmap matrix
X.text = X.text,
yr = iris[,"Petal.Length"], #plot Petal.Length to the right
yr.axis.name = "Petal.Length",
membership.rows = iris[,"Species"], # cluster the rows by species
yt = iris.coef[-1], # plot the model coefficients above
yt.plot.type = "bar", # make the plot above a barplot
yt.axis.name = "coefficient")
X.text <- matrix(c("Sepal", "Sepal", "Sepal", "Sepal", "Sepal", "Sepal", "Petal", "Petal", "Petal"),
ncol = 3,
byrow = T)
X.text.size <- matrix(c(5, 5, 10, 5, 5, 15, 5, 5, 5), ncol = 3, byrow = T)
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")], # heatmap matrix
X.text = X.text,
X.text.size = X.text.size,
X.text.angle = 30,
yr = iris[,"Petal.Length"], #plot Petal.Length to the right
yr.axis.name = "Petal.Length",
membership.rows = iris[,"Species"], # cluster the rows by species
yt = iris.coef[-1], # plot the model coefficients above
yt.plot.type = "bar", # make the plot above a barplot
yt.axis.name = "coefficient")
X.text <- matrix(c("Sepal", "Sepal", "Sepal", "Sepal", "Sepal", "Sepal", "Petal", "Petal", "Petal"),
ncol = 3,
byrow = T)
X.text.col <- matrix(c("black", "black", "black", "red", "white", "red", "black", "black", "black"), ncol = 3, byrow = T)
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")], # heatmap matrix
X.text = X.text,
X.text.col = X.text.col,
yr = iris[,"Petal.Length"], #plot Petal.Length to the right
yr.axis.name = "Petal.Length",
membership.rows = iris[,"Species"], # cluster the rows by species
yt = iris.coef[-1], # plot the model coefficients above
yt.plot.type = "bar", # make the plot above a barplot
yt.axis.name = "coefficient")
In the iris example above, we have specified our own cluster vector (the Species variable which consists of three classes: Setosa, Versicolor and Virginica). However, if you do not have a cluster membership vector in mind (or cannot be bothered to generate one using your clustering algorithm of preference), never fear! The superheat function has in-built clustering options for performing K-means (the default if a membership vector is not supplied) and hierarchical clustering (specify clustering.method = "hierarchical") on the rows and/or columns.\
set.seed(134)
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")],
yr = iris[,"Petal.Length"],
yr.axis.name = "Petal.Length",
n.clusters.rows = 3,
clustering.method = "kmeans")
Again we note that, by default, if no membership vectors are supplied, the default is to cluster the rows but not the columns. As a result, the number of row clusters must be supplied (above we have n.clusters.rows = 3), otherwise an error will be produced. If you wish to cluster the columns also, you could specify a number of column clusters using the n.cluster.col argument.
If you wanted to keep the variable names as the labels, you could simply specify bottom.label = "variable".
set.seed(134)
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")],
yr = iris[,"Petal.Length"],
yr.axis.name = "Petal.Length",
n.clusters.rows = 3,
n.clusters.col = 2,
bottom.label = "variable")
Moreover, if you have used the in-built clustering algorithms, you can obtain the clustering membership vector by saving the superheat object and acessing the membership element (below we hide the plot itself, which is the same as the previous plot, by specifying print.plot = F).
set.seed(134)
plot <- superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")],
yr = iris[,"Petal.Length"],
yr.axis.name = "Petal.Length",
n.clusters.rows = 3,
print.plot = F)
plot$membership.rows
#> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
#> 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
#> 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> 37 38 39 40 41 42 43 44 45 46 47 48 49 50 58 61 94 52
#> 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
#> 54 55 56 57 59 60 62 63 64 65 67 68 69 70 71 72 73 74
#> 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
#> 75 76 79 80 81 82 83 84 85 86 88 89 90 91 92 93 95 96
#> 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
#> 97 98 99 100 102 104 107 114 120 122 124 127 128 134 135 138 139 143
#> 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
#> 147 150 51 53 66 77 78 87 101 103 105 106 108 109 110 111 112 113
#> 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
#> 115 116 117 118 119 121 123 125 126 129 130 131 132 133 136 137 140 141
#> 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
#> 142 144 145 146 148 149
#> 3 3 3 3 3 3
We note that the k-means clustering corresponds mostly to the Species variable.
The top/right plots can be one of several types. The default is a scatterplot. The following other options are also available:
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")],
yr = iris[,"Petal.Length"],
membership.rows = iris[,"Species"],
yr.axis.name = "Petal.Length",
yr.plot.type = "scattersmooth")
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")],
yr = iris[,"Petal.Length"],
membership.rows = iris[,"Species"],
yr.axis.name = "Petal.Length",
yr.plot.type = "scattersmooth",
smoothing.method = "lm")
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")],
yr = iris[,"Petal.Length"],
membership.rows = iris[,"Species"],
yr.axis.name = "Petal.Length",
yr.plot.type = "smooth")
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")],
yr = iris[,"Petal.Length"],
membership.rows = iris[,"Species"],
yr.axis.name = "Petal.Length",
yr.plot.type = "line")
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")],
yr = iris[,"Petal.Length"],
membership.rows = iris[,"Species"],
yr.axis.name = "Petal.Length",
yr.plot.type = "scatterline")
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")],
yr = iris[,"Petal.Length"],
membership.rows = iris[,"Species"],
yr.axis.name = "Petal.Length",
yr.plot.type = "bar",
yr.bar.col = "grey")
y vector whose length is the number of clusters, rather than the number of rows/columns)superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")],
yr = 1:3,
membership.rows = iris[,"Species"],
yr.axis.name = "Petal.Length",
yr.plot.type = "bar",
yr.bar.col = "grey")
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")],
yr = iris[,"Petal.Length"],
membership.rows = iris[,"Species"],
yr.axis.name = "Petal.Length",
yr.plot.type = "boxplot")
yr or row clustering information when setting the yr.plot.type to dendrogram)superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")],
yr.plot.type = "dendrogram")
It is not necessary to have all components at all times in your plot. For example, if you do not supply a yt vector, no top scatterplot is provided. Similarly if you do not supply a yr vector. Further options include:
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")],
yr = iris[,"Petal.Length"],
yr.axis.name = "Petal.Length",
membership.rows = iris[,"Species"],
legend = FALSE)
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")],
yr = iris[,"Petal.Length"],
yr.axis.name = "Petal.Length",
membership.rows = iris[,"Species"],
left.label = "none")
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")],
yr = iris[,"Petal.Length"],
yr.axis.name = "Petal.Length",
membership.rows = iris[,"Species"],
bottom.label = "none")
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")],
yr = iris[,"Petal.Length"],
membership.rows = iris[,"Species"],
yr.axis = F)
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")],
yr = iris[,"Petal.Length"],
membership.rows = iris[,"Species"],
yr.axis.name = "Petal.Length",
title = "I'm a title!")
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")],
yr = iris[,"Petal.Length"],
membership.rows = iris[,"Species"],
yr.axis.name = "Petal.Length",
column.title = "I'm a column name!")
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")],
yr = iris[,"Petal.Length"],
membership.rows = iris[,"Species"],
yr.axis.name = "Petal.Length",
row.title = "I'm a row name!")
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")],
yr = iris[,"Petal.Length"],
membership.rows = iris[,"Species"],
yr.axis.name = "Petal.Length",
padding = 2.5)
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")],
yr = iris[,"Petal.Length"],
membership.rows = iris[,"Species"],
yr.axis.name = "Petal.Length",
grid.hline = F)
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")],
yr = iris[,"Petal.Length"],
membership.rows = iris[,"Species"],
yr.axis.name = "Petal.Length",
grid.vline = F)
The default colour palates are as shown in the above examples, however the user can specify their own colour palates:
heat.col.scheme argument.superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")],
yr = iris[,"Petal.Length"],
membership.rows = iris[,"Species"],
yr.axis.name = "Petal.Length",
heat.col.scheme = "blue")
heat.pal argument: you can specify as many colours as you want.superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")],
yr = iris[,"Petal.Length"],
membership.rows = iris[,"Species"],
yr.axis.name = "Petal.Length",
heat.pal = c("yellow", "brown", "blue"))
heat.pal argument and selecting a diverging palette (e.g. the Red/Blue palette from the RColorBrewer color brewer package).library(RColorBrewer)
scaled_iris <- scale(iris[,c("Sepal.Width","Sepal.Length","Petal.Width")])
superheat(X = scaled_iris,
yr = iris[,"Petal.Length"],
membership.rows = iris[,"Species"],
yr.axis.name = "Petal.Length",
heat.pal = brewer.pal(8, "RdBu"))
heat.pal.values argument. The length of heat.pal.values must be the same as the length of the heat.pal argument.library(RColorBrewer)
scaled_iris <- scale(iris[,c("Sepal.Width","Sepal.Length","Petal.Width")])
superheat(X = scaled_iris,
yr = iris[,"Petal.Length"],
membership.rows = iris[,"Species"],
yr.axis.name = "Petal.Length",
heat.pal = brewer.pal(8, "RdBu"),
heat.pal.values = c(0, 0.2, 0.25, 0.3, 0.7, 0.8, 1))
yr with yt for the top scatterplot): note that the order of the points in the yr.obs.col vector correspond to the order of the points in yr.col.vec <- rep("black", 150)
col.vec[50:70] <- "red"
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")],
yr = iris[,"Petal.Length"],
membership.rows = iris[,"Species"],
yr.axis.name = "Petal.Length",
yr.obs.col = col.vec)
yr.superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")],
yr = iris[,"Petal.Length"],
membership.rows = iris[,"Species"],
yr.axis.name = "Petal.Length",
yr.cluster.col = c("red","blue","orange"))
bottom.label.pal and bottom.label.text.col)superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")],
yr = iris[,"Petal.Length"],
membership.rows = iris[,"Species"],
yr.axis.name = "Petal.Length",
left.label.col = c("blue","red","purple"),
left.label.text.col = c("white","black","white"))
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")],
yr = iris[,"Petal.Length"],
membership.rows = iris[,"Species"],
yr.axis.name = "Petal.Length",
grid.hline.col = "white",
grid.vline.col = "red")
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")],
yr = iris[,"Petal.Length"],
membership.rows = iris[,"Species"],
yr.axis.name = "Petal.Length",
smooth.heat = TRUE)
There are a number of options to change the text size, point size and legend size in the plot. For example, you can specify the following arguments in the superheat function:
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")],
yr = iris[,"Petal.Length"],
membership.rows = iris[,"Species"],
yr.axis.name = "Petal.Length",
legend.width = 2,
legend.height = 0.2)
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")],
yr = iris[,"Petal.Length"],
membership.rows = iris[,"Species"],
yr.axis.name = "Petal.Length",
bottom.label.size = 0.4,
bottom.label.text.angle = 45,
left.label.size = 0.4,
left.label.text.angle = 0)
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")],
yr = iris[,"Petal.Length"],
membership.rows = iris[,"Species"],
yr.axis.name = "Petal.Length",
left.label.size = 0.3,
left.label.text.alignment = "right",
left.label.text.angle = 0,
bottom.label.size = 0.5,
bottom.label.text.alignment = "right",
bottom.label.text.angle = 90)
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")],
yr = iris[,"Petal.Length"],
membership.rows = iris[,"Species"],
yr.axis.name = "Petal.Length",
left.label.text.size = 7,
bottom.label.text.size = 2)
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")],
yr = iris[,"Petal.Length"],
membership.rows = iris[,"Species"],
yr.axis.name = "Petal.Length",
yr.axis.size = 15,
yr.axis.name.angle = 10,
yr.axis.name.size = 20)
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")],
yr = iris[,"Petal.Length"],
membership.rows = iris[,"Species"],
yr.axis.name = "Petal.Length",
yr.point.size = 3.5)
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")],
yr = iris[,"Petal.Length"],
membership.rows = iris[,"Species"],
yr.axis.name = "Petal.Length",
grid.vline.size = 2,
grid.hline.size = 2)
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")],
yr = iris[,"Petal.Length"],
membership.rows = iris[,"Species"],
yr.axis.name = "Petal.Length",
order.rows = order(iris$Sepal.Width))
The best format for saving superheat images is as a .png file. To do this in R, the easiest way is to envoke the png() function (remember to call dev.off() when you’re done!)
png("superheat.png", height = 500, width = 800)
superheat(X = iris[,c("Sepal.Width","Sepal.Length","Petal.Width")],
yr = iris[,"Petal.Length"],
membership.rows = iris[,"Species"],
yr.axis.name = "Petal.Length",
order.rows = order(iris$Sepal.Width))
dev.off()
Thanks for using the package! I hope you find it helpful in your data exploration adventures. The github development page can be found at https://github.com/rlbarter/superheat. For pull requests and suggestions please follow the standard protocol. For pressing questions or comments feel free to email me at rebeccabarter@berkeley.edu.